Prepare Data¶
In this page, we will introduce the functions we provide to load datasets and split given data.
Load Data¶
In s3l.datasets.base
, we provide some useful functions to load data. Here is the list:
'load_data',
'load_dataset',
'load_graph',
'load_boston',
'load_diabetes',
'load_digits',
'load_iris',
'load_breast_cancer',
'load_linnerud',
'load_wine',
'load_ionosphere',
'load_australian',
'load_bupa',
'load_haberman',
'load_vehicle',
'load_covtype',
'load_housing10',
'load_spambase',
'load_house',
'load_clean1'
Among them, load_data
, load_dataset
and load_graph
functions can be used to load the data you prepare. Other functions load the built-in datasets which are commonly used by researchers. These functions return the data in the form which can be used by estimators directly. For example,
X, y = load_XXX(return_X_y=False)
# XXX is the name of dataset
We’ll show you how to use the two user-oriented functions load_data
, load_dataset
and load_graph
. load_dataset
is directly called in experiments classes, you can use them when you try algorithms outside experiment class or when you’re implementing you own experiment class.
load_data
loads features and labels of a dataset given the file names.
X, y = load_data(feature_file, label_file)
load_dataset
wraps load_data
with another parameter name and loads built-in dataset if name matchs.
X, y = load_dataset(name, feature_file, label_file)
load_graph
loads the graph in *.csv/npz/mat
file and returns a matrix.
W = load_graph(graph_file)
Split Data¶
In s3l.datasets.data_manipulate
, we provide some useful functions to split data. Here is the list:
'inductive_split',
'ratio_split',
'cv_split'
Among them, inductive_split
can split the dataset into three parts: labeled set, unlabeled set and testing set, which is helpful for semi-supervised learning tasks.
from sklearn.datasets import make_classification
from s3l.datasets import data_manipulate
X, y = make_classification()
train_idx, test_idx, label_idx, unlabel_idx = \
data_manipulate.inductive_split(X, y,test_ratio=0.3,
initial_label_rate=0.05, split_count=10)
ratio_split
and cv_split
help split the given data based on train/test ratio and k-Fold.
from sklearn.datasets import make_classification
from s3l.datasets import data_manipulate
X, y = make_classification()
# ratio_split
train_idx, test_idx = \
data_manipulate.ratio_split(X, y, unlabel_ratio=0.3,
split_count=10)
# cv_split
train_idx, test_idx = \
data_manipulate.cv_split(X, y, k=3, split_count=10)
The returned XXX_indexes are lists of indexes which can be directly used by built-in estimators.